PM Accelerator Tech Assessment: Weather Trend Forecasting¶

Objective:¶

Used the Global Weather Repository.csv to predict future weather patterns and demonstrate data science skills using a mix of foundational and advanced methods. This dataset provides daily weather data for cities worldwide, featuring over 40 attributes that capture global weather conditions. The dataset is available on Kaggle at Global Weather Repository, contains 47162 records and 41 columns, including variables like temperature, precipitation, wind speed, and air quality metrics.

Data Cleaning and Preprocessing¶

In this phase, I perfrorm data cleaning and preprocessing by handling missing values, removing duplicates, and detecting and capping outliers using the IQR method. Additionally, histograms are used to visualize the distribution of numeric data before and after normalization, ensuring the dataset is clean, consistent, and ready for further analysis.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [84]:
df = pd.read_csv("GlobalWeatherRepository.csv")
pd.set_option('display.max_columns', None)
df.head()
Out[84]:
country location_name latitude longitude timezone last_updated_epoch last_updated temperature_celsius temperature_fahrenheit condition_text wind_mph wind_kph wind_degree wind_direction pressure_mb pressure_in precip_mm precip_in humidity cloud feels_like_celsius feels_like_fahrenheit visibility_km visibility_miles uv_index gust_mph gust_kph air_quality_Carbon_Monoxide air_quality_Ozone air_quality_Nitrogen_dioxide air_quality_Sulphur_dioxide air_quality_PM2.5 air_quality_PM10 air_quality_us-epa-index air_quality_gb-defra-index sunrise sunset moonrise moonset moon_phase moon_illumination
0 Afghanistan Kabul 34.52 69.18 Asia/Kabul 1715849100 2024-05-16 13:15 26.6 79.8 Partly Cloudy 8.3 13.3 338 NNW 1012.0 29.89 0.0 0.00 24 30 25.3 77.5 10.0 6.0 7.0 9.5 15.3 277.0 103.0 1.1 0.2 8.4 26.6 1 1 04:50 AM 06:50 PM 12:12 PM 01:11 AM Waxing Gibbous 55
1 Albania Tirana 41.33 19.82 Europe/Tirane 1715849100 2024-05-16 10:45 19.0 66.2 Partly cloudy 6.9 11.2 320 NW 1012.0 29.88 0.1 0.00 94 75 19.0 66.2 10.0 6.0 5.0 11.4 18.4 193.6 97.3 0.9 0.1 1.1 2.0 1 1 05:21 AM 07:54 PM 12:58 PM 02:14 AM Waxing Gibbous 55
2 Algeria Algiers 36.76 3.05 Africa/Algiers 1715849100 2024-05-16 09:45 23.0 73.4 Sunny 9.4 15.1 280 W 1011.0 29.85 0.0 0.00 29 0 24.6 76.4 10.0 6.0 5.0 13.9 22.3 540.7 12.2 65.1 13.4 10.4 18.4 1 1 05:40 AM 07:50 PM 01:15 PM 02:14 AM Waxing Gibbous 55
3 Andorra Andorra La Vella 42.50 1.52 Europe/Andorra 1715849100 2024-05-16 10:45 6.3 43.3 Light drizzle 7.4 11.9 215 SW 1007.0 29.75 0.3 0.01 61 100 3.8 38.9 2.0 1.0 2.0 8.5 13.7 170.2 64.4 1.6 0.2 0.7 0.9 1 1 06:31 AM 09:11 PM 02:12 PM 03:31 AM Waxing Gibbous 55
4 Angola Luanda -8.84 13.23 Africa/Luanda 1715849100 2024-05-16 09:45 26.0 78.8 Partly cloudy 8.1 13.0 150 SSE 1011.0 29.85 0.0 0.00 89 50 28.7 83.6 10.0 6.0 8.0 12.5 20.2 2964.0 19.0 72.7 31.5 183.4 262.3 5 10 06:12 AM 05:55 PM 01:17 PM 12:38 AM Waxing Gibbous 55
In [4]:
df['last_updated'] = pd.to_datetime(df['last_updated'])
In [5]:
pd.isnull(df).sum()
Out[5]:
country                         0
location_name                   0
latitude                        0
longitude                       0
timezone                        0
last_updated_epoch              0
last_updated                    0
temperature_celsius             0
temperature_fahrenheit          0
condition_text                  0
wind_mph                        0
wind_kph                        0
wind_degree                     0
wind_direction                  0
pressure_mb                     0
pressure_in                     0
precip_mm                       0
precip_in                       0
humidity                        0
cloud                           0
feels_like_celsius              0
feels_like_fahrenheit           0
visibility_km                   0
visibility_miles                0
uv_index                        0
gust_mph                        0
gust_kph                        0
air_quality_Carbon_Monoxide     0
air_quality_Ozone               0
air_quality_Nitrogen_dioxide    0
air_quality_Sulphur_dioxide     0
air_quality_PM2.5               0
air_quality_PM10                0
air_quality_us-epa-index        0
air_quality_gb-defra-index      0
sunrise                         0
sunset                          0
moonrise                        0
moonset                         0
moon_phase                      0
moon_illumination               0
dtype: int64

I'm are checking for missing values in the dataset by counting the number of null entries in each column

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47162 entries, 0 to 47161
Data columns (total 41 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   country                       47162 non-null  object        
 1   location_name                 47162 non-null  object        
 2   latitude                      47162 non-null  float64       
 3   longitude                     47162 non-null  float64       
 4   timezone                      47162 non-null  object        
 5   last_updated_epoch            47162 non-null  int64         
 6   last_updated                  47162 non-null  datetime64[ns]
 7   temperature_celsius           47162 non-null  float64       
 8   temperature_fahrenheit        47162 non-null  float64       
 9   condition_text                47162 non-null  object        
 10  wind_mph                      47162 non-null  float64       
 11  wind_kph                      47162 non-null  float64       
 12  wind_degree                   47162 non-null  int64         
 13  wind_direction                47162 non-null  object        
 14  pressure_mb                   47162 non-null  float64       
 15  pressure_in                   47162 non-null  float64       
 16  precip_mm                     47162 non-null  float64       
 17  precip_in                     47162 non-null  float64       
 18  humidity                      47162 non-null  int64         
 19  cloud                         47162 non-null  int64         
 20  feels_like_celsius            47162 non-null  float64       
 21  feels_like_fahrenheit         47162 non-null  float64       
 22  visibility_km                 47162 non-null  float64       
 23  visibility_miles              47162 non-null  float64       
 24  uv_index                      47162 non-null  float64       
 25  gust_mph                      47162 non-null  float64       
 26  gust_kph                      47162 non-null  float64       
 27  air_quality_Carbon_Monoxide   47162 non-null  float64       
 28  air_quality_Ozone             47162 non-null  float64       
 29  air_quality_Nitrogen_dioxide  47162 non-null  float64       
 30  air_quality_Sulphur_dioxide   47162 non-null  float64       
 31  air_quality_PM2.5             47162 non-null  float64       
 32  air_quality_PM10              47162 non-null  float64       
 33  air_quality_us-epa-index      47162 non-null  int64         
 34  air_quality_gb-defra-index    47162 non-null  int64         
 35  sunrise                       47162 non-null  object        
 36  sunset                        47162 non-null  object        
 37  moonrise                      47162 non-null  object        
 38  moonset                       47162 non-null  object        
 39  moon_phase                    47162 non-null  object        
 40  moon_illumination             47162 non-null  int64         
dtypes: datetime64[ns](1), float64(23), int64(7), object(10)
memory usage: 14.8+ MB

Getting a summary of the dataframe, specificaly the number of non-null values & data types

In [7]:
print(df.duplicated().sum())
0

checks how many duplicate rows are in the dataframe

In [8]:
def remove_outliers_IQR(df):
    numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
    df_clean = df.copy()
    outlier_counts = {}

    for col in numeric_cols:
        
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        
        
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        
        outliers_before = len(df_clean[(df_clean[col] < lower_bound) | (df_clean[col] > upper_bound)])

        # Remove outliers
        df_clean = df_clean[(df_clean[col] >= lower_bound) & (df_clean[col] <= upper_bound)]
        
        # Detect outliers after capping
        outliers_after = len(df_clean[(df_clean[col] < lower_bound) | (df_clean[col] > upper_bound)])

        outlier_counts[col] = (outliers_before, outliers_after)

    return df_clean, outlier_counts

cleaned_data, outlier_counts = remove_outliers_IQR(df)
for column, (outliers_before, outliers_after) in outlier_counts.items():
    print(f'{column}: Outliers before capping: {outliers_before}, Outliers after capping: {outliers_after}')
latitude: Outliers before capping: 0, Outliers after capping: 0
longitude: Outliers before capping: 3628, Outliers after capping: 0
last_updated_epoch: Outliers before capping: 0, Outliers after capping: 0
temperature_celsius: Outliers before capping: 1530, Outliers after capping: 0
temperature_fahrenheit: Outliers before capping: 1077, Outliers after capping: 0
wind_mph: Outliers before capping: 509, Outliers after capping: 0
wind_kph: Outliers before capping: 31, Outliers after capping: 0
wind_degree: Outliers before capping: 0, Outliers after capping: 0
pressure_mb: Outliers before capping: 1914, Outliers after capping: 0
pressure_in: Outliers before capping: 1652, Outliers after capping: 0
precip_mm: Outliers before capping: 7247, Outliers after capping: 0
precip_in: Outliers before capping: 0, Outliers after capping: 0
humidity: Outliers before capping: 0, Outliers after capping: 0
cloud: Outliers before capping: 0, Outliers after capping: 0
feels_like_celsius: Outliers before capping: 402, Outliers after capping: 0
feels_like_fahrenheit: Outliers before capping: 84, Outliers after capping: 0
visibility_km: Outliers before capping: 3796, Outliers after capping: 0
visibility_miles: Outliers before capping: 0, Outliers after capping: 0
uv_index: Outliers before capping: 0, Outliers after capping: 0
gust_mph: Outliers before capping: 193, Outliers after capping: 0
gust_kph: Outliers before capping: 22, Outliers after capping: 0
air_quality_Carbon_Monoxide: Outliers before capping: 2380, Outliers after capping: 0
air_quality_Ozone: Outliers before capping: 202, Outliers after capping: 0
air_quality_Nitrogen_dioxide: Outliers before capping: 3104, Outliers after capping: 0
air_quality_Sulphur_dioxide: Outliers before capping: 1880, Outliers after capping: 0
air_quality_PM2.5: Outliers before capping: 1322, Outliers after capping: 0
air_quality_PM10: Outliers before capping: 1234, Outliers after capping: 0
air_quality_us-epa-index: Outliers before capping: 2078, Outliers after capping: 0
air_quality_gb-defra-index: Outliers before capping: 1477, Outliers after capping: 0
moon_illumination: Outliers before capping: 0, Outliers after capping: 0

The IQR method removes outliers by calculating the interquartile range (IQR) for each numeric column and filtering values outside the calculated bounds. After applying this method, there are now 0 outliers in the cleaned dataset

In [85]:
for col in cleaned_data.columns:
    if cleaned_data[col].dtype != 'object': 
        plt.figure(figsize=(12, 6))
        sns.histplot(df[col], kde=True, bins=30)
        plt.title(f'Distribution of {col} (Before normalization)')
        plt.xlabel(col)
        plt.ylabel('Frequency')
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/Downloads/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'year'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[85], line 4
      2 if cleaned_data[col].dtype != 'object': 
      3     plt.figure(figsize=(12, 6))
----> 4     sns.histplot(df[col], kde=True, bins=30)
      5     plt.title(f'Distribution of {col} (Before normalization)')
      6     plt.xlabel(col)

File ~/Downloads/anaconda3/lib/python3.12/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/Downloads/anaconda3/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'year'
<Figure size 1200x600 with 0 Axes>

This code makes histograms for each number-based column in the data to show how the values are spread out before any changes are made. It uses 30 bins to group the data and also adds a smooth line to show the overall shape of the data.

In [86]:
for col in cleaned_data.columns:
    if cleaned_data[col].dtype != 'object': 
        plt.figure(figsize=(12, 6))
        sns.histplot(cleaned_data[col], kde=True, bins=30) 
        plt.title(f'Dist. of {col} (After normalization)')
        plt.xlabel(col)
        plt.ylabel('Frequency')
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

This code generates histograms with KDE plots for each numeric column in cleaned_data to visualize their distributions after normalization.

Basic EDA¶

In my EDA process, I create visualizations like line plots to observe trends in temperature, precipitation, and other variables, and scatter plots to show relationships. I also use correlation matrices to analyze the connections between weather and air quality data.

In [12]:
plt.figure(figsize=(10, 6))
sns.lineplot(x='last_updated', y='temperature_celsius', data=cleaned_data, color='orange')
plt.title('Temperature Trends Over Time')
plt.xticks(rotation=45)
plt.show()

plt.figure(figsizeb=(10, 6))
sns.lineplot(x='last_updated', y='precip_mm', data=cleaned_data, color='hotpink')
plt.title('Precipitation Trends Over Time')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
No description has been provided for this image

I created lineplots to show the trends of temperature and precipitation over time using last_updated as the x-axis. The orange line graph shows the trends in temperature in celsius over time with temperature increasing in the summer and decreasing in the winter. The pink line graph shows spikes of rain followed by periods of dryness.

In [14]:
import plotly.express as px

list_of_countries = ['Portugal', 'Argentina', 'Brazil', 'Germany']

trend = cleaned_data.query(f"country in @list_of_countries")


fig = px.line(
    trend,
    x='last_updated',
    y='precip_mm',
    title="Precipitation Trends by Country",
    color="country",
    markers=True,
    hover_data=['precip_mm']
)


fig.update_xaxes(title="Timestamp of Observation")
fig.update_yaxes(title="Precipitation (mm)")
fig.update_traces(marker=dict(size=6), line=dict(width=2))
fig.update_layout(
    legend_title=dict(text="Country"),
    height=600,
    width=1000,
    title_font=dict(size=20)
)


fig.show()

Argentina, Germany, and Portugal show relatively low precipitation levels, with sporadic high spikes in some months. Whereas Brazil shows a significant spike in precipitation around January 2025.

In [15]:
list_of_countries = ['Portugal', 'Argentina', 'Brazil', 'Germany']


trend = cleaned_data.query(f"country in @list_of_countries")


fig = px.line(
    trend,
    x='last_updated',
    y='temperature_fahrenheit',
    title="Temperature Trends by Country",
    color="country",
    markers=True,
    hover_data=['temperature_fahrenheit']
)


fig.update_xaxes(title="Timestamp of Observation")
fig.update_yaxes(title="Temperature (°F)")
fig.update_traces(marker=dict(size=6), line=dict(width=2))
fig.update_layout(
    legend_title=dict(text="Country"),
    height=600,
    width=1000,
    title_font=dict(size=20)
)


fig.show()

The line graph illustrates the temperature trends of Argentina, Germany, Portugal, and Brazil from June 2024 to January 2025, showing significant fluctuations in Argentina, Germany, and Portugal while Brazil shows a steady decrease in temperature over the same period.

In [16]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=cleaned_data['wind_kph'], y=cleaned_data['gust_kph'], color = 'hotpink')
plt.title('Wind Speed vs Gust Speed')
plt.xlabel('Wind Speed (kph)')
plt.ylabel('Gust Speed (kph)')
plt.show()
No description has been provided for this image

The scatter plot reveals a clear positive linear relationship between wind speed and gust speed, indicating that higher wind speeds are associated with stronger gusts. The close clustering of data points suggests a strong correlation, emphasizing wind speed as a significant factor in gust intensity.

In [17]:
correlation_matrix = cleaned_data[['temperature_celsius', 'humidity', 'wind_kph', 'precip_mm', 'pressure_mb']].corr()

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Weather Variables')
plt.show()
No description has been provided for this image

Overall, the correlation matrix suggests that temperature and humidity as well as pressure and temperature have the strongest relationship amongst this group of features, and precipitation and humidity have a moderate positive relationship. The other variables do not appear to be strongly correlated with each other.

In [18]:
air_quality_cols = ['air_quality_PM2.5', 'air_quality_PM10', 'air_quality_Carbon_Monoxide', 'air_quality_Nitrogen_dioxide']
correlation_matrix_air_quality = cleaned_data[air_quality_cols].corr()

plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix_air_quality, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Air Quality Indicators')
plt.show()
No description has been provided for this image

Air Quality PM2.5 and Air Quality PM10 have a strong positive correlation (0.75). This suggests that when PM2.5 levels are high, PM10 levels are also likely to be high.

Advanced and Basic Model Building/Forecasting¶

I built three models—gradient boosting, random forest, and linear regression—to predict precipitation trends. I evaluated each model using MAE, RMSE, and R2, and then created an ensemble of their predictions to improve accuracy, as part of the basic and advanced forecasting requirements.

In [29]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

cleaned_data['last_updated'] = pd.to_datetime(cleaned_data['last_updated'])
cleaned_data['year'] = cleaned_data['last_updated'].dt.year
cleaned_data['month'] = cleaned_data['last_updated'].dt.month
cleaned_data['day'] = cleaned_data['last_updated'].dt.day
cleaned_data['hour'] = cleaned_data['last_updated'].dt.hour
cleaned_data['minute'] = cleaned_data['last_updated'].dt.minute
cleaned_data['weekday'] = cleaned_data['last_updated'].dt.weekday 

cleaned_data = cleaned_data.drop(columns=['last_updated'])
features = cleaned_data.drop(columns=['precip_mm', 'anomaly'], errors='ignore')
target = cleaned_data['precip_mm']
features = pd.get_dummies(features, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)


models = {
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Linear Regression": LinearRegression()
}

performance = {}
predictions = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test) 
    predictions[name] = preds 
    performance[name] = {
        "MAE": mean_absolute_error(y_test, preds),
        "RMSE": np.sqrt(mean_squared_error(y_test, preds)),
        "R2 Score": r2_score(y_test, preds)
    }

ensemble_preds = np.mean(np.column_stack(list(predictions.values())), axis=1)
performance["Ensemble"] = {
    "MAE": mean_absolute_error(y_test, ensemble_preds),
    "RMSE": np.sqrt(mean_squared_error(y_test, ensemble_preds)),
    "R2 Score": r2_score(y_test, ensemble_preds)
}
print("Model Performance:")
for model_name, scores in performance.items():
    print(f"{model_name}: MAE={scores['MAE']:.3f}, RMSE={scores['RMSE']:.3f}, R2 Score={scores['R2 Score']:.3f}")
Model Performance:
Gradient Boosting: MAE=0.006, RMSE=0.012, R2 Score=0.398
Random Forest: MAE=0.005, RMSE=0.011, R2 Score=0.411
Linear Regression: MAE=0.020, RMSE=0.164, R2 Score=-120.266
Ensemble: MAE=0.010, RMSE=0.056, R2 Score=-13.010

This code trains three different models (gradient boosting, random forest, and linear regression) to predict precipitation using features from a timestamp, then evaluates each model's performance with metrics like MAE, RMSE, and R2. It also combines the predictions from all models into an ensemble and checks how the ensemble performs compared to the individual models.

Advanced EDA¶

I used Isolation Forest to detect outliers in temperature and precipitation data. After identifying the outliers, I visualized them in plots with the outliers marked in red for temperature and orange for precipitation, as part of the advanced EDA for anomaly detection.

In [30]:
from sklearn.ensemble import IsolationForest

numeric_columns = df.select_dtypes(include=[np.number]).columns

iso_forest = IsolationForest(contamination=0.05, random_state=42)

df['outlier'] = iso_forest.fit_predict(df[numeric_columns])

outliers = df[df['outlier'] == -1]

print(f"Total number of detected outliers: {len(outliers)}")

fig, (ax_temp, ax_precip) = plt.subplots(2, figsize=(12, 10))

ax_temp.plot(df.index, df['temperature_fahrenheit'], label='Temperature', color='blue')
ax_temp.scatter(outliers.index, outliers['temperature_fahrenheit'], color='red', label='Outliers')
ax_temp.set_title("Outliers in Temperature Data")
ax_temp.set_xlabel("Time")
ax_temp.set_ylabel("Temperature (°F)")
ax_temp.legend()

ax_precip.plot(df.index, df['precip_mm'], label='Precipitation', color='green')
ax_precip.scatter(outliers.index, outliers['precip_mm'], color='orange', label='Outliers')
ax_precip.set_title("Outliers in Precipitation Data")
ax_precip.set_xlabel("Time")
ax_precip.set_ylabel("Precipitation (mm)")
ax_precip.legend()

plt.tight_layout()
plt.show()
Total number of detected outliers: 2359
No description has been provided for this image

I applied the Isolation Forest method to find outliers in temperature and precipitation data. It then creates plots showing the data over time, marking the outliers in red for temperature and orange for precipitation.

Advanced Climate Analysis:¶

I selected countries from different continents and plotted their temperature trends over time. Each plot shows the temperature changes for each country, helping analyze long-term climate patterns in various regions.

In [89]:
countries_to_plot = {
    "North America": ["USA United States of America United States of America", "Canada", "Mexico"],
    "South America": ["Brazil", "Argentina", "Haiti"],
    "Europe": ["Germany", "France", "Italy"],
    "Africa": ["Nigeria", "South Africa", "Egypt"],
    "Asia": ["Sri Lanka", "North Korea", "Vietnam"],
    "Oceania": ["Australia", "New Zealand"]
}

for continent, countries in countries_to_plot.items():
    subset_data = df[df['country'].isin(countries)]
    
    plt.figure(figsize=(12, 6))
    sns.lineplot(x=df["last_updated"], y='temperature_celsius', hue='country', data=subset_data)
    plt.title(f'Long-Term Climate Patterns in {continent}')
    plt.xlabel('Date')
    plt.ylabel('Temperature (°C)')
    plt.legend(title='Country', loc='upper right')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

I selected countries from each continent and plotted their temperature trends over time. Each plot shows how temperature has changed, with different colors for each country.

Advanced Environmental Impact Analysis:¶

I analyzed the relationship between air quality and weather parameters, such as temperature and humidity. The scatter plots show weak or no strong correlation between temperature and carbon monoxide levels, and humidity and PM2.5 levels.

In [79]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='temperature_celsius', y='air_quality_Carbon_Monoxide', data=cleaned_data, color = "teal")
plt.title('Temperature vs Carbon Monoxide Levels')
plt.show()
No description has been provided for this image

The scatter plot shows the relationship between temperature and carbon monoxide levels, and the data suggests that there is no strong correlation between the two variables.

In [77]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='humidity', y='air_quality_PM2.5', data=cleaned_data, alpha=0.7, color='purple')
plt.title('Humidity vs PM2.5 Levels')
plt.xlabel('Humidity (%)')
plt.ylabel('PM2.5 Levels')
plt.tight_layout()
plt.show()
No description has been provided for this image

The scatter plot shows the relationship between temperature and carbon monoxide levels, and the data suggests that there is no strong correlation between the two variables.

Advanced Feature Importance¶

I used a RandomForestRegressor to assess feature importance in predicting temperature. The bar plot shows how each feature (precipitation, humidity, wind speed, and pressure) contributes to the model's prediction of temperature

In [78]:
features = cleaned_data[['precip_mm', 'humidity', 'wind_kph', 'pressure_mb']]
target = cleaned_data['temperature_celsius']

model = RandomForestRegressor()
model.fit(features, target)

importance = pd.Series(model.feature_importances_, index=features.columns)
importance.plot(kind='bar', title='Key Features Influencing Temperature')
plt.show()
No description has been provided for this image

Humidity has the strongest influence on temperature. Wind speed and pressure have moderate influence. Precipitation has the weakest influence

Advanced Spatial Analysis:¶

I visualized global wind speed patterns by plotting wind speed data on a world map. I used latitude and longitude coordinates to place windspeed data points, with colors showing wind speed variations across different regions.

In [91]:
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(15, 10))
map_plot = Basemap(projection='cyl', resolution='l', 
                   llcrnrlat=-90, urcrnrlat=90, 
                   llcrnrlon=-180, urcrnrlon=180)

map_plot.drawcoastlines()
map_plot.drawcountries()

lats = cleaned_data['latitude']
lons = cleaned_data['longitude']
temps = cleaned_data['wind_kph']

scatter = map_plot.scatter(lons, lats, c=temps, cmap='coolwarm', marker='o', s=40, alpha=0.7, zorder=5)
plt.colorbar(scatter, label='Wind Speed (kph)')
plt.title('Global Wind Patterns')
plt.show()
No description has been provided for this image

The map shows that wind speeds tend to be higher in the Northern Hemisphere, with some areas reaching over 30 kph. Whereas the Southern Hemisphere generally has lower wind speeds, mostly between 10 and 20 kph, showing differences in how air circulates between the two hemispheres.

Advanced Geographical Patterns:¶

I generated an intearctive global temperature heatmap using plotly. It shows temperature variations across the world, with yellow representing the hottest areas and purple for the coldest regions, highlighting high temperatures in parts of Africa, the Middle East, and Australia.

In [81]:
import plotly.express as px

fig = px.scatter_geo(
    cleaned_data,
    lat='latitude',
    lon='longitude',
    color='temperature_celsius',
    hover_name='location_name',
    projection='natural earth',
    title='Global Temperature Heatmap'
)

fig.update_geos(showcoastlines=True, coastlinecolor="black", showland=True, landcolor="white")

fig.show()

The heatmap displays global temperatures, with yellow indicating the hottest areas primarily in the tropics, while purple represents the coldest regions. It is interesting to note the high temperatures in parts of Africa, the Middle East, and Australia.